Polymorphism detection utilizing clustering analysis

ABSTRACT

Systems and methods for detecting differences in sample polymers, such as nucleic acid sequences, are provided. Hybridization affinity information for the sample polymers is clustered so that the differences, if any, between or among the sample polymers can be readily identified. By clustering the hybridization affinity information of the sample polymers, differences in the sample polymers can be accurately achieved even in the presence of random and systematic errors.

[0001] This application claims the benefit of U.S. ProvisionalApplication No. 60/055,939, filed Aug. 15, 1997, and is a continuationof U.S. patent application Ser. No. 09/134,758, filed Aug. 14, 1998, nowissued as U.S. Pat. No. ______ ,both of which are hereby incorporated byreference.

BACKGROUND OF THE INVENTION

[0002] The present invention relates to detecting differences inpolymers. More specifically, the present invention relates to detectingpolymorphisms in sample nucleic acid sequences by clusteringhybridization affinity information.

[0003] Devices and computer systems for forming and using arrays ofmaterials on a chip or substrate are known. For example, PCTapplications W092/10588 and 95/11995, both incorporated herein byreference for all purposes, describe techniques for sequencing orsequence checking nucleic acids and other materials. Arrays forperforming these operations may be formed according to the methods of,for example, the pioneering techniques disclosed in U.S. Pat. Nos.5,445,934, 5,384,261 and 5,571,639, each incorporated herein byreference for all purposes.

[0004] According to one aspect of the techniques described therein, anarray of nucleic acid probes is fabricated at known locations on a chip.A labeled nucleic acid is then brought into contact with the chip and ascanner generates an image file indicating the locations where thelabeled nucleic acids are bound to the chip. Based upon the image fileand identities of the probes at specific locations, it becomes possibleto extract information such as the nucleotide or monomer sequence of DNAor RNA. Such systems have been used to form, for example, arrays of DNAthat may be used to study and detect mutations relevant to geneticdiseases, cancers, infectious diseases, HIV, and other geneticcharacteristics.

[0005] The VLSIPS™ technology provides methods of making very largearrays of oligonucleotide probes on very small chips. See U.S. Pat. No.5,143,854 and PCT patent publication Nos. WO 90/15070 and 92/10092, eachof which is incorporated by reference for all purposes. Theoligonucleotide probes on the DNA probe array are used to detectcomplementary nucleic acid sequences in a sample nucleic acid ofinterest (the “target” nucleic acid).

[0006] For sequence checking applications, the chip may be tiled for aspecific target nucleic acid sequence. As an example, the chip maycontain probes that are perfectly complementary to the target sequenceand probes that differ from the target sequence by a single basemismatch. For de novo sequencing applications, the chip may include allthe possible probes of a specific length. The probes are tiled on a chipin rows and columns of cells, where each cell includes multiple copiesof a particular probe. Additionally, “blank” cells may be present on thechip which do not include any probes. As the blank cells contain noprobes, labeled targets should not bind specifically to the chip in thisarea. Thus, a blank cell provides a measure of the background intensity.

[0007] The interpretation of hybridization data from hybridized chipscan encounter several difficulties. Random errors, such as physicaldefects on the chip, can cause individual probes or spatially relatedgroups of probes exhibit abnormal hybridization (e.g., by abnormalfluorescence). Systematic errors, such as the formation of secondarystructures in the probes or the target, can also cause reproducible, butstill misleading hybridization data.

[0008] For many applications, it is desirable to determine if there aredifferences between and among sample nucleic acid sequences, such aspolymorphisms at a base position. It would be desirable to have systemsand methods of detecting these differences in a way that is not overlyaffected by random and systematic errors.

SUMMARY OF THE INVENTION

[0009] The present invention provides innovative systems and methods fordetecting differences in sample polymers, such as nucleic acidsequences. Hybridization affinity information for the sample polymers isclustered so that the differences, if any, between or among the samplepolymers can be readily identified. By clustering the hybridizationaffinity information of the sample polymers, differences in the samplepolymers can be accurately achieved even in the presence of random andsystematic errors. Additionally, polymorphisms can be detected in samplenucleic acids regardless of what basecalling has reported. Severalembodiments of the invention are described below.

[0010] In one embodiment, the invention provides a method of detectingdifferences in sample polymers. Multiple sets of hybridization affinityinformation are input, where each set of hybridization affinityinformation includes hybridization affinities between a sample polymerand polymer probes. The multiple sets of hybridization affinityinformation are clustered into multiple clusters such that all sets ofhybridization affinity information in each cluster are more similar toeach other than to the sets of hybridization affinity information inanother cluster. The multiple clusters can then be analyzed to detect ifthere are differences in the sample polymers. For example, if themultiple clusters do not form clusters where subclusters are verysimilar yet very different from other clusters, this can indicate thatthe sample polymers are the same. Otherwise, the sample polymers can bedifferent.

[0011] In another embodiment, the invention provides a method ofdetecting polymorphisms in sample nucleic acid sequences. Multiple setsof hybridization affinity information are input, where each set ofhybridization affinity information includes hybridization affinitiesbetween a sample nucleic acid sequence and nucleic acid probes. Themultiple sets of hybridization affinity information are hierarchicallyclustered into a plurality of clusters such that all sets ofhybridization affinity information in each cluster are more similar toeach other than to the sets of hybridization affinity information inanother cluster. The multiple clusters can then be analyzed to detect ifthere are polymorphisms in the sample polymers. The polymorphisms caninclude mutations, insertions and deletions.

[0012] Other features and advantages of the invention will becomereadily apparent upon review of the following detailed description inassociation with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013]FIG. 1 illustrates an example of a computer system that may beutilized to execute the software of an embodiment of the invention.

[0014]FIG. 2 illustrates a system block diagram of the computer systemof FIG. 1.

[0015]FIG. 3 illustrates an overall system for forming and analyzingarrays of biological materials such as DNA or RNA.

[0016]FIG. 4 illustrates conceptually the binding of probes on chips.

[0017]FIG. 5 shows a high level flowchart of a process of analyzingsample polymers.

[0018]FIG. 6 shows a flowchart of a process clustering hybridizationaffinity data.

[0019]FIG. 7 shows a flowchart of a process of analyzing sample nucleicacid sequences.

[0020]FIG. 8 shows graphically how normalization can affect thehybridization affinities.

[0021]FIG. 9 illustrates a screen display including a dendrogramindicating that there does not appear to be a polymorphism at the baseposition of interest (SEQ ID NO: 1; SEQ ID NO:2; SEQ ID NO:3; SEQ IDNO:4; SEQ ID NO:5; and SEQ ID NO:6).

[0022]FIG. 10 shows the dendrogram of FIG. 9.

[0023]FIG. 11 illustrates a dendrogram indicating that is likely apolymorphism at the base position of interest.

[0024]FIG. 12 illustrates a dendrogram indicating that there is likelymore than one polymorphism at the base position of interest.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0025] In the description that follows, the present invention will bedescribed in reference to preferred embodiments that utilize VLSIPSTMtechnology for making very large arrays of oligonucleotide probes onchips. However, the invention is not limited to nucleic acids or to thistechnology and may be advantageously applied to other polymers andmanufacturing processes. Therefore, the description of the embodimentsthat follows for purposes of illustration and not limitation.

[0026]FIG. 1 illustrates an example of a computer system that may beused to execute the software of an embodiment of the invention. FIG. 1shows a computer system 1l that includes a display 3, screen 5, cabinet7, keyboard 9, and mouse 11. Mouse 11 may have one or more buttons forinteracting with a graphical user interface. Cabinet 7 houses a CD-ROMdrive 13, system memory and a hard drive (see FIG. 2) which may beutilized to store and retrieve software programs incorporating computercode that implements the invention, data for use with the invention, andthe like. Although a CD-ROM 15 is shown as an exemplary computerreadable storage medium, other computer readable storage media includingfloppy disk, tape, flash memory, system memory, and hard drive may beutilized. Additionally, a data signal embodied in a carrier wave (e.g.,in a network including the Internet) may be the computer readablestorage medium.

[0027]FIG. 2 shows a system block diagram of computer system 1 used toexecute the software of an embodiment of the invention. As in FIG. 1,computer system 1 includes monitor 3 and keyboard 9, and mouse 11.Computer system 1 further includes subsystems such as a centralprocessor 51, system memory 53, fixed storage 55 (e.g., hard drive),removable storage 57 (e.g., CD-ROM drive), display adapter 59, soundcard 61, speakers 63, and network interface 65. Other computer systemssuitable for use with the invention may include additional or fewersubsystems. For example, another computer system could include more thanone processor 51 (i.e., a multi-processor system) or a cache memory.

[0028] The system bus architecture of computer system 1 is representedby arrows 67. However, these arrows are illustrative of anyinterconnection scheme serving to link the subsystems. For example, alocal bus could be utilized to connect the central processor to thesystem memory and display adapter. Computer system 1 shown in FIG. 2 isbut an example of a computer system suitable for use with the invention.Other computer architectures having different configurations ofsubsystems may also be utilized.

[0029] For purposes of illustration, the present invention is describedas being part of a computer system that designs a chip mask, synthesizesthe probes on the chip, labels the nucleic acids, and scans thehybridized nucleic acid probes. Such a system is fully described in U.S.Pat. No. 5,571,639 that has been incorporated by reference for allpurposes. However, the present invention may be used separately from theoverall system for analyzing data generated by such systems.

[0030]FIG. 3 illustrates a computerized system for forming and analyzingarrays of biological materials such as RNA or DNA. A computer 100 isused to design arrays of biological polymers such as RNA and DNA. Thecomputer 100 may be, for example, an appropriately programmed SunWorkstation or personal computer or workstation, such as an IBM PCequivalent, including appropriate memory and a CPU as shown in FIGS. 1and 2. The computer system 100 obtains inputs from a user regardingcharacteristics of a gene of interest, and other inputs regarding thedesired features of the array. Optionally, the computer system mayobtain information regarding a specific genetic sequence of interestfrom an external or internal database 102 such as GenBank. The output ofthe computer system 100 is a set of chip design computer files 104 inthe form of, for example, a switch matrix, as described in PCTapplication WO 92/10092, and other associated computer files.

[0031] The chip design files are provided to a system 106 that designsthe lithographic masks used in the fabrication of arrays of moleculessuch as DNA. The system or process 106 may include the hardwarenecessary to manufacture masks 110 and also the necessary computerhardware and software 108 necessary to lay the mask patterns out on themask in an efficient manner. As with the other features in FIG. 3, suchequipment may or may not be located at the same physical site but isshown together for ease of illustration in FIG. 3. The system 106generates masks 110 or other synthesis patterns such as chrome-on-glassmasks for use in the fabrication of polymer arrays.

[0032] The masks 110, as well as selected information relating to thedesign of the chips from system 100, are used in a synthesis system 112.Synthesis system 112 includes the necessary hardware and software usedto fabricate arrays of polymers on a substrate or chip 114. For example,synthesizer 112 includes a light source 116 and a chemical flow cell 118on which the substrate or chip 114 is placed. Mask 110 is placed betweenthe light source and the substrate/chip, and the two are translatedrelative to each other at appropriate times for deprotection of selectedregions of the chip. Selected chemical regents are directed through flowcell 118 for coupling to deprotected regions, as well as for washing andother operations. All operations are preferably directed by anappropriately programmed computer 119, which may or may not be the samecomputer as the computer(s) used in mask design and mask making.

[0033] The substrates fabricated by synthesis system 112 are optionallydiced into smaller chips and exposed to marked targets. The targets mayor may not be complementary to one or more of the molecules on thesubstrate. The targets are marked with a label such as a fluoresceinlabel (indicated by an asterisk in FIG. 3) and placed in scanning system120. Although preferred embodiments utilize fluorescent markers, othermarkers may be utilized that provide differences in radioactiveintensity, light scattering, refractive index, conductivity,electroluminescence, or other large molecule detection data. Therefore,the present invention is not limited to analyzing fluorescencemeasurements of hybridization but may be readily utilized to analyzeother measurements of hybridization.

[0034] Scanning system 120 again operates under the direction of anappropriately programmed digital computer 122, which also may or may notbe the same computer as the computers used in synthesis, mask making,and mask design. The scanner 120 includes a detection device 124 such asa confocal microscope or CCD (charge-coupled device) that is used todetect the location where labeled target (*) has bound to the substrate.The output of scanner 120 is an image file(s) 124 indicating, in thecase of fluorescein labeled target, the fluorescence intensity (photoncounts or other related measurements, such as voltage) as a function ofposition on the substrate. Since higher photon counts will be observedwhere the labeled target has bound more strongly to the array ofpolymers (e.g., DNA probes on the substrate), and since the monomersequence of the polymers on the substrate is known as a function ofposition, it becomes possible to determine the sequence(s) of polymer(s)on the substrate that are complementary to the target.

[0035] The image file 124 is provided as input to an analysis system 126that incorporates the synthesis integrity evaluation techniques of thepresent invention. Again, the analysis system may be any one of a widevariety of computer system(s), but in a preferred embodiment theanalysis system is based on a WINDOWS NT workstation or equivalent. Theanalysis system may analyze the image file(s) to generate appropriateoutput 128, such as the identity of specific mutations in a target suchas DNA or RNA.

[0036]FIG. 4 illustrates the binding of a particular target DNA to anarray of DNA probes 114. As shown in this simple example, the followingprobes are formed in the array: 3′-AGAACGT    AGACCGT    AGAGCGT   AGATCGT       •       •       •

[0037] As shown, when the fluorescein-labeled (or otherwise marked)target 5′-TCTTGCA is exposed to the array, it is complementary only tothe probe 3′-AGAACGT, and fluorescein will be primarily found on thesurface of the chip where 3′-AGAACGT is located. The chip contains cellsthat include multiple copies of a particular probe and the cells may besquare regions on the chip.

[0038]FIG. 5 is a high level flowchart of a process of analyzing samplepolymers, such as nucleic acid sequences. At a step 201, sets ofhybridization affinity information are input to a computer system. Thehybridization affinity information can be in any number of formsincluding fluorescent, radioactive or other data. The hybridizationaffinity information can be utilized without modification as input forclustering analysis. However, the variations in the data can be reducedby normalizing the data.

[0039] The hybridization affinity information of each set is normalizedat a step 203. Normalization can be utilized to provide more consistentdata between and within experiments. As an example, normalization caninclude dividing each hybridization affinity value by the sum of all theother hybridization affinity values, thus reducing each hybridizationaffinity value to a value between 0 and 1. Although normalization can bebeneficial in some applications, it is not required. Therefore, thesteps shown in the flowcharts illustrate specific embodiments and stepscan be deleted, inserted, combined, and modified within the spirit andscope of the invention.

[0040] At a step 205, the sets of hybridization affinity information areclustered. Clustering analysis processes typically accept as inputmultiple patterns of data (e.g., represented by vectors of floatingpoint numbers) and rearrange the patterns into clusters of similarpatterns. Preferred embodiments arrange patterns of data intohierarchical clusters where each cluster includes clusters that are moresimilar to each other than to other clusters.

[0041] Once the clusters are formed, they can be displayed on the screenfor a user to analyze at a step 207. In addition to displaying theclusters, the computer system can also interpret the clusters and outputto the user the number of distinct clusters that were found. Thedescription of FIG. 5 has been provided at a high level to give thereader an initial understanding of the invention and the descriptionthat follows will describe the invention in more detail.

[0042]FIG. 6 shows a flowchart of a process clustering hybridizationaffinity data. At a step 301, a check is performed to see if the sets ofhybridization affinity information have been clustered into a singleroot cluster. A cluster can include one or more subclusters and a rootcluster is a cluster that is not included in any other cluster. In thedescription that follows, a cluster (or subcluster) can be a single setof hybridization affinity information or include multiple sets.

[0043] Initially, each set of hybridization affinity information isconsidered a single cluster. As the clustering continues, clusters thatare found to be similar enough are grouped together into a new cluster.When it is determined that all the sets of hybridization affinityinformation are clustered into a single root cluster at a step 303, theclustering is done.

[0044] Otherwise, the two closest clusters are found at a step 305. Bybeing closest, it is meant that a metric indicates that two of theclusters include data that are more similar to each other than any ofthe other clusters are to another cluster. Any number of differentmetrics can be utilized including the Euclidean distance described inmore detail in reference to FIG. 7. Most preferably, the metricsatisfies the triangle inequality such that f(a,c)<=f(a,b)+f(b,c) forany set of data patterns {a,b,c}.

[0045] In the embodiments described herein, a cluster includes up to twosets of hybridization affinity information. However, there is norequirement that the clusters be limited in this manner. For example,the invention can be advantageously applied to clusters that can includeup to three or more sets of hybridization affinity information by anextension of the principles described herein.

[0046] At a step 307, a new cluster is created that includes the twoclosest clusters. In order to compare the new cluster with otherclusters, a value should be calculated to represent the data in the newcluster. In one embodiment, the average of the two closest clusters iscomputed for the new cluster at a step 309. After the new cluster hasbeen created, the flow proceeds to step 301 to check if only one rootcluster remains.

[0047]FIG. 7 shows a flowchart of a process of analyzing sample nucleicacid sequences. For this embodiment, hybridization data from a chip withboth sense and anti-sense probes are utilized. Fragments from the senseand anti-sense strands of a target are labeled and exposed to the chipresulting in four hybridization affinity measurements for the sensestrand and four hybridization affinity measurements for the anti-sensestrand at each interrogation position.

[0048] As an example, if the sense strand of a target sequence (orportion thereof) is 5′-GTAACGTTG then the following sense probes wouldinterrogate the underlined base position: 5′-AAAGT 5′-AACGT 5′-AAGGT5′-AATGT

[0049] The anti-sense strand of the target sequence (or portion thereof)would be 3′-CATTGCAAC and the following sense probes would interrogatethe underlined base position for the anti-sense strand: 3′-TTACA3′-TTCCA 3′-TTGCA 3′-TTTCA

[0050] Accordingly, in this embodiment, there are eight hybridizationaffinities, one for each probe, for each interrogation position.

[0051] At a step 401, sets of hybridization affinity information areinput to a computer system. This can include reading a file thatincludes hybridization affinity data for each base position that isinterrogated in the target. As discussed above, the hybridizationaffinity data for a base position can include eight measuredhybridization affinities. The eight measured hybridization affinitiescan be stored as a set or pattern of eight values (e.g., photon counts)such as {A₁, A₂, . . . ,A₈}.

[0052] The hybridization affinity information of each set is normalizedat a step 403. Normalizing the hybridization affinity information cande-emphasize differences that are not directly related to targetsequence composition. One effective strategy for normalizing thehybridization affinities of a set is to first calculate the average ofthe hybridization affinities for a set and subtract this average fromeach hybridization affinity in the set. Then, each average-subtractedhybridization affinity is divided by the square root of the sum ofsquares of the hybridization affinities of the set minus the averagehybridization affinity. In other words, the following formula isutilized normalize each hybridization affinity of a set:

A _(I)=(A _(I) −{overscore (A)})/square root((A ₁ −{overscore (A)})²+(A₂ −{overscore (A)})²+ . . . +(A ₈ −{overscore (A)})²)

[0053] where I is from 1 to 8 and A is the average of A₁, A₂, . . . ,A_(8.)

[0054]FIG. 8 shows graphically how the normalization can affect thehybridization affinities. Hybridization affinities 451 are the raw datameasured from the chip and the height of the bars indicates the relativemeasured hybridization affinity.

[0055] Average-subtracted hybridization affinities 453 show that thehybridization affinities are now vectors in two possible directions. Theaverage-subtracted hybridization affinities are combined into anintermediate vector pattern 455. Normalization of vector pattern 455 iscompleted by dividing each vector by the denominator above to produce afinal normalized vector pattern 457.

[0056] Normalization can correct for varying backgrounds and overallhybridization affinity values, while preserving the rank of eachhybridization affinity within the set as well as the difference inoverall hybridization affinity between the sense and anti-sense probes.Additionally, by normalizing the set of eight values in the mannerdescribed, the distance between any two patterns is bounded by (0,2),thus offering a consistent scale on which to pattern differences can beevaluated.

[0057] Returning to FIG. 7, at a step 405, the sets of hybridizationaffinity information are hierarchically clustered. Any number ofclustering algorithms can be utilized. In preferred embodiments, amodification of the mean linkage clustering algorithm is utilized. Thevalue of a cluster that includes only a single set of hybridizationaffinities is the pattern of eight hybridization affinities. The valueof a cluster C that includes two clusters A and B is as follows:

C_(I)=average(A₁,B₁)

[0058] where I is from 1 to 8. Thus, each cluster is represented by aneight value pattern. Other linkage calculations can be utilizedincluding traditional mean linkage wherein the mean of the distancesbetween each member of a pattern is utilized. Additionally, the greatest(or least) distance between two members of two clusters can be utilizedas the linkage formula.

[0059] The distance between two clusters is typically determined by adistance metric. Many different distance metrics can be utilizedincluding the Euclidean distance, city-block distance, correlationdistance, angular distance, and the like. Most preferably, the Euclideandistance is utilized and it is calculated as follows:

D _(AB)=square root((A ₁ −B ₁)²+(A ₂ −B ₂)²+ . . . +(A ₈ −B ₈)²)

[0060] where I is from 1 to 8. The city-block distance can be calculatedas follows:

D _(AB)=|(A ₁ −B ₁)|+|(A ₂ −B ₂)|+ . . . +|(A ₈ −B ₈)|

[0061] where I is from 1 to 8 and |X| represents the absolute value ofX.

[0062] At a step 407, the number of “tight” clusters is counted. A“tight” cluster is a defined as any cluster where the average distancefrom the cluster mean to the means of its subclusters is less than thedistance to its nearest sibling cluster by a similarity factor (e.g., afactor of 3). It is fairly easy for a user to visually identifyclusters, but the number of tight clusters can be utilized as acalculated determination of the number of clusters. If there are two ormore tight clusters, the interrogation position is likely to bepolymorphic. It should be noted that increasing the number of dimensionsin an input pattern strongly reduces the probability that two patternswill be similar by chance and the value of the similarity factor can beadjusted accordingly.

[0063] The clusters are displayed at a step 409. The clusters can bedisplayed any number of ways, but in preferred embodiments, they aredisplayed as dendrograms. Dendrograms are diagrams that represent theclusters. The distance between the clusters can be represented on thedendrogram so that the user can more readily identify the clusters thatwould be indicative of a polymorphism such as a mutation, insertion ordeletion. In other words, the distance between the clusters varies withthe similarity of the clusters.

[0064] As an example, FIG. 9 illustrates a screen display including adendrogram indicating that there does not appear to be a polymorphism atthe base position of interest. A screen display 501 includes adendrogram 503. The dendrogram will be described in more detail inreference to FIG. 10.

[0065] Screen display 501 includes raw data 505 and the indicated basecalls. A plot 507 of hybridization affinities vs. base position is shownfor both the sense and anti-sense strands for pattern recognition. Atable 509 includes information on base positions for the chip.Additionally, an image 511 provides information for mutant fractionestimation. Dendrogram 503 (and others) will be the focus of thefollowing paragraphs.

[0066]FIG. 10 shows a dendrogram from FIG. 9 that clusters eight sets ofhybridization affinity information (represented by the target name). Avisual inspection of dendrogram 503 reveals that the distance betweenthe clusters (illustrated by the horizontal lengths of the dendrogram)are relatively constant. This indicates that the patterns are relativelyconstant and therefore, it does not appear likely there is apolymorphism at the interrogation position.

[0067]FIG. 11 illustrates a dendrogram indicating that is likely apolymorphism at the base position of interest. Dendrogram 603 shows theclustering of eight sets of hybridization affinity information. A visualinspection of the dendrogram reveals that there appears to be twoclusters 605 and 607 where the distance between members of one clusteris much less than the distance between members of other clusters. As thepatterns fall in two clusters, there is likely a polymorphism at theinterrogation position.

[0068] As another example, FIG. 12 illustrates a screen displayincluding a dendrogram indicating that there is likely more than onepolymorphism at the base position of interest. A dendrogram 703 showsthe clustering of eight sets of hybridization affinity information. Avisual inspection of the dendrogram reveals that there appears to bethree clusters 705, 707 and 709 where the distance between members ofone cluster is much less than the distance between members of otherclusters. Since the patterns fall in three clusters, there are likelytwo polymorphisms at the interrogation position.

[0069] With the invention, phenomena that are not obvious throughexamination of a single hybridization reaction can be detected.Conversely, the number and diversity of probes for recognizing aparticular class of phenomena can be reduced. For example, mutations inthe BRCA gene are so diverse that constructing a set of probes thatwould cover every possible polymorphism may be impractical. However, theinvention may be utilized to detect such polymorphisms even in theabsence of such probes.

[0070] In addition, clustering can be utilized to analyze or evaluatethe effectiveness of experimental systems, such as genotyping chips, inwhich useful results are dependent on the detection of a fixed number ofhighly reproducible classes in the resulting data. In the case ofgenotyping, one expects three tightly clustered result classesrepresenting homozygous wildtype, homozygous mutant and heterozygotegenotypes, respectively. Metrics computed on the hierarchy of patternsgenerated by a clustering algorithm can provide a quantitativeassessment of the specificity and reproducibility of the genotypingprocess.

[0071] While the above is a complete description of preferredembodiments of the invention, various alternatives, modifications, andequivalents may be used. It should be evident that the invention isequally applicable by making appropriate modifications to theembodiments described above. For example, the invention has beendescribed in reference to nucleic acid probes that are synthesized on achip. However, the invention may be advantageously applied to othermonomers (e.g., amino acids and saccharides) and other hybridizationtechniques including those where the probes are not attached to asubstrate. Therefore, the above description should not be taken aslimiting the scope of the invention that is defined by the metes andbounds of the appended claims along with their full scope ofequivalents.

1 6 40 base pairs nucleic acid single linear 1 TTTAATTTTT TTAGGATGTGGGATTTAATT CATCATTGGC 40 40 base pairs nucleic acid single linear 2TTTAATTTTT TTAGGATGTN GGATTTAATT CATCATTTCC 40 40 base pairs nucleicacid single linear 3 TTTAATTTTT TTAGNATGTN GGATTTAATT CATCATTTCC 40 40base pairs nucleic acid single linear 4 TTTAATTTTT TTAGNATGTN GNATTTAATTCATCATTTCC 40 40 base pairs nucleic acid single linear 5 TTTAATTTTTTTAGNATGTA GNATTTAATT CATCATTTNC 40 40 base pairs nucleic acid singlelinear 6 TTTAATTTTT TTAGGATGTA GGATTTAATT CATCATTNNC 40

What is claimed is:
 1. A method of detecting differences in samplepolymers, comprising: inputting a plurality of sets of hybridizationaffinity information, each set of hybridization affinity informationincluding hybridization affinities between a sample polymer and polymerprobes; clustering the plurality of sets of hybridization affinityinformation into a plurality of clusters such that all sets ofhybridization affinity information in each cluster are more similar toeach other than to the sets of hybridization affinity information inanother cluster; and analyzing the plurality of clusters to detect ifthere are differences in the sample polymers.
 2. The method of claim 1,wherein the clustering the plurality of sets of hybridization affinityinformation includes calculating mean linkage clustering of theclusters.
 3. The method of claim 2, wherein the mean linkage clusteringof the probes utilizes a distance metric for differences betweenclusters.
 4. The method of claim 3, wherein the distance metric is aEuclidean distance or a city-block distance.
 5. The method of claim 1,further comprising displaying a tree structure of the plurality ofclusters.
 6. The method of claim 5, wherein the distance between theclusters varies with the similarity of the clusters.
 7. The method ofclaim 1, wherein the sample polymers include nucleic acids, amino acidsor saccharides.
 8. A computer program product that detects differencesin sample polymers, comprising: computer code that receives a pluralityof sets of hybridization affinity information, each set of hybridizationaffinity information including hybridization affinities between a samplepolymer and polymer probes; computer code that clusters the plurality ofsets of hybridization affinity information into a plurality of clusterssuch that all sets of hybridization affinity information in each clusterare more similar to each other than to the sets of hybridizationaffinity information in another cluster; computer code that analyzes theplurality of clusters to detect if there are differences in the samplepolymers; and a computer readable medium that stores the computer codes.9. The computer program product of claim 8, wherein the computerreadable medium is selected from the group consisting of floppy disk,tape, flash memory, system memory, hard drive, and a data signalembodied in a carrier wave.
 10. A method of detecting polymorphisms insample nucleic acid sequences, comprising: inputting a plurality of setsof hybridization affinity information, each set of hybridizationaffinity information including hybridization affinities between a samplenucleic acid sequence and nucleic acid probes; hierarchically clusteringthe plurality of sets of hybridization affinity information into aplurality of clusters such that all sets of hybridization affinityinformation in each cluster are more similar to each other than to thesets of hybridization affinity information in another cluster; andanalyzing the plurality of clusters to detect if there are polymorphismsin the sample polymers.
 11. The method of claim 10, wherein the samplenucleic acid sequence and nucleic acid probes include both sense andanti-sense strands.
 12. The method of claim 11, wherein thehybridization affinity information includes four hybridizationaffinities for the sense strands and four hybridization affinities forthe anti-sense strands.
 13. The method of claim 12, wherein the fourhybridization affinities for the sense strands represent hybridizationaffinities between nucleic acid probes that differ by at least a nucleicacid at an interrogation position.
 14. The method of claim 12, whereinthe four hybridization affinities for the anti-sense strands representhybridization affinities between nucleic acid probes that differ by atleast a nucleic acid at an interrogation position.
 15. The method ofclaim 10, wherein the polymorphisms include mutations, deletions andinsertions at an interrogation position.
 16. The method of claim 10,further comprising normalizing the hybridization affinity informationfor each set.
 17. The method of claim 16, wherein the normalizing thehybridization affinity information for each set includes subtracting anaverage hybridization affinity from the hybridization affinities anddividing each hybridization affinity by a square root of the sum ofsquares of the hybridization affinities.
 18. The method of claim 10,wherein the clustering the plurality of sets of hybridization affinityinformation includes calculating mean linkage clustering of theclusters.
 19. The method of claim 18, wherein the mean linkageclustering of the probes utilizes a distance metric for differencesbetween clusters.
 20. The method of claim 19, wherein the distancemetric is a Euclidean distance or a city-block distance.
 21. The methodof claim 10, further comprising displaying a tree structure of theplurality of clusters.
 22. The method of claim 21, wherein the distancebetween the clusters varies with to the similarity of the clusters. 23.A computer program product that detects polymorphisms in sample nucleicacid sequences, comprising: computer code that receives a plurality ofsets of hybridization affinity information, each set of hybridizationaffinity information including hybridization affinities between a samplenucleic acid sequence and nucleic acid probes; computer code thathierarchically clusters the plurality of sets of hybridization affinityinformation into a plurality of clusters such that all sets ofhybridization affinity information in each cluster are more similar toeach other than to the sets of hybridization affinity information inanother cluster; computer code that analyzes the plurality of clustersto detect if there are polymorphisms in the sample polymers; and acomputer readable medium that stores the computer codes.
 22. Thecomputer program product of claim 21, wherein the computer readablemedium is selected from the group consisting of floppy disk, tape, flashmemory, system memory, hard drive, and a data signal embodied in acarrier wave.